Brian S. Evans, Ph.D.
Migratory Bird Center
Smithsonian Conservation Biology Institute
# Load RCurl library:
library(RCurl)
# Load a source script:
script <-
getURL(
"https://raw.githubusercontent.com/bsevansunc/workshop_languageOfR/master/sourceCode.R"
)
# Evaluate then remove the source script:
eval(parse(text = script))
rm(script)
Why would you use for loops? Let’s look at a common example. You have been asked to calculate the mean petal length of three Iris species: Iris setosa, Iris versicolor, and Iris virginica. (coded as the factor levels setosa, versicolor, and virginica in the species field of the irisTbl data frame). Using the mean function and tools that we have addressed thus far, use the following to calculate the mean petal length for each species:
# Filter irisTbl to setosa:
irisTbl[irisTbl$species == 'setosa', ]
# Extract the petalLength field (column):
irisTbl[irisTbl$species == 'setosa', ]$petalLength
# Calculate the mean of petal lengths:
mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)
Exercise One:
Calculate the mean petal length of each of the Iris species using matrix notation (as above) and a custom function.
The code you generated above is very repetitive – the only change in each of the lines you should have created above was the name of the species. Writing code like this makes your scripts unnecessarily long and prone to user input error. For loops should be used whenever a chunk of code is repeated more than twice.
We’ll return to the Iris example a bit later, but for let’s review indexing.
Consider the following numeric vector, v:
| [1] | [2] | [3] | [4] | [5] |
|---|---|---|---|---|
| 1 | 1 | 2 | 3 | 5 |
Vector v is an R object comprised of five numbers.
# Explore vector v:
v
class(v)
str(v)
length(v)
Each value in a vector has a position, denoted by “[i]”.
Recall: v[i] is the value of v at position i.
# Explore vector v using indexing:
i <- 3
v[i]
v[3]
v[3] == v[i]
\[V_{new, i} = V_{i} + 1\]
Each value in a vector has a position, denoted by “[i]”.
Recall: v[i] is the value of v at position i.
# Add 1 to the value of v at position three:
i <- 3
v[3] + 1
v[i] + 1
We would like to modify the values in vector v by adding one to each value. This might be written mathematically as:
To do so, we will write a for loop. Writing proper for loops requires following these three steps:
For loop development begins by defining an output object of a given length that your for loop will write to. This step is very important – your for loop will run much, much slower if you do not do so!
Vector objects are defined as follows:
# Define a vector for output:
vNew <- vector('numeric', length = length(v))
str(vNew)
The first argument of the vector function is the type of object you would like to create. The next argument sets the length of the created object to be the equivalent to that of v.
Values of vector vNew will be replaced location-by-location. For example, let’s compare the initial value of v with the resultant value of v + 1 at position 3:
# Explore filling values of vNew by index:
i <- 3
v[i]
vNew[i] <- v[i] + 1
vNew[i]
v[i] + 1 == vNew[i]
The utility of the for loop is that we can calculate the above for each position (i) in vector v. This is done by setting the “for loop sequence” statement which defines the locations over which the for loop will be calculated. The for loop sequence for locations one through five is written as (DO NOT RUN):
# for(i in 1:5)
The above can be translated as “for position i in positions one through five”.
To make our code as flexible as possible, we generally do not want to have to directly type in the stopping point of the for loop. We can use 1:length(v) or the function seq_along to specify the for loop address. Run the following and note the behavior:
v
1:5
1:length(v)
seq_along(v)
The for loop sequence statement can then be written as (DO NOT RUN):
# Example for loop sequence statements:
# for(i in 1:length(v))
# for(i in seq_along(v))
The above statements are more flexible than providing a numeric stopping point for your for loop.
The for loop sequence statement is followed by the “body” statement that tells R what to do during each iteration of the loop. The body associated with our “add one” formula is (DO NOT RUN):
# vNew[i] <- v[i] + 1
If your for loop spans multiple lines, place the body within curly brackets, for example (DO NOT RUN):
# for(i in 1:length(x)){
# body
# }
Putting together our steps of: 1) Creating an output object, 2) The for loop sequence statement, and 3) The body statement, our completed for loop is written as follows (RUN THIS ONE):
# First for loop:
vNew <- numeric(length = length(v))
for(i in seq_along(v)){
vNew[i] <- v[i] + 1
}
Take a look at the output and compare it with the values of v:
# Explore first for loop output:
vNew
v
vNew == v
Exercise Two:
\[y = mx + b\]
- Convert the above mathematical formula to a function with arguments
m,b, andx.
- Generate a sequential vector of values containing all integers from 1-10. Assign the name
xto the vector object.
- Use a for loop and the function above to calculate values of
ywhere:m = 0.5,b = 1.0, andxrefers to the vectorxabove (Note: A for loop is not really required here).
You may have noticed that none of the operations we completed in the previous section actually required for loops (for example, v + 1 is calculated for each value of v by default). For loops are predominantly used to split-apply-combine data. In data science, split-apply-combine problems are those that relate to situations in which you seek to split a dataset into multiple parts, apply a function to each part, and combine the resulting parts. For loops can be a great tool for split-apply-combine problems.
We will summarize the iris data frame to illustrate using a for loop for a typical split-apply-combine problem. Calculating the mean for each species without a for loop required the following code:
# Mean petal lengths of Iris species without a for loop:
mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)
mean(irisTbl[irisTbl$species == 'versicolor', ]$petalLength)
mean(irisTbl[irisTbl$species == 'virginica', ]$petalLength)
We can use a for loop to calculate this value across species. To do so, we first need to create a vector of species:
# Make a vector of species to loop across:
irisSpecies <- levels(irisTbl$species)
irisSpecies
Next, we’ll create an empty vector to store our output:
# For loop output statement:
petalLengths <- vector('numeric',length = length(irisSpecies))
petalLengths
The vector of Iris species will define the bounds of our for loop sequence (DO NOT RUN):
# For loop sequence:
# for(i in seq_along(irisSpecies))
Split: To construct the for loop body, we’ll start by splitting the data:
# Exploring the iris data, subsetting by species:
i <- 3
irisSpecies[i]
irisTbl[irisTbl$species == irisSpecies[i], ]
# Split:
iris_sppSubset <- irisTbl[irisTbl$species == irisSpecies[i], ]
Apply: Modification of the data:
# Calculate mean petal length of each subset:
mean(iris_sppSubset$petalLength)
The completed for loop is constructed as:
# Make a vector of species to loop across:
irisSpecies <- levels(irisTbl$species)
# For loop output statement:
petalLengths <- vector('numeric',length = length(irisSpecies))
# For loop:
for(i in seq_along(irisSpecies)){
iris_sppSubset <- irisTbl[irisTbl$species == irisSpecies[i], ]
petalLengths[i] <- mean(iris_sppSubset$petalLength)
}
Combine: Combining the for loop output can be done in a number of ways. Below, we bind our results in a tidy tibble data frame, using the data_frame function.
# Make a tibble data frame of the for loop output:
petalLengthFrame <- data_frame(species = irisSpecies, count = petalLengths)
petalLengthFrame
Exercise Three:
Use a for loop and the
birdHabitsdata frame to calculate the number species in each diet guild.
For loops can be used to perform powerful data queries. To illustrate how this works, we will use birdCounts data frame and calculate the total number of omnivorous birds that were observed at each of the sites. Take a moment to explore the data:
# Explore the bird count data:
head(birdCounts)
str(birdCounts)
# Explore the bird trait data:
head(birdHabits)
str(birdHabits)
Before we construct a for loop to address this problem, we should explore how to sum omnivore counts for a single site. Let’s start by addressing how we might subset birdCounts to omnivores. The below calculates the total omnivore counts across sites.
# Extract vector of omnivorous species:
omnivores <- birdHabits[birdHabits$diet == 'omnivore',]$species
# Subset the counts to omnivores:
birdCounts[birdCounts$species %in% omnivores, ]$count
# Calculate the sum of counts:
sum(birdCounts[birdCounts$species %in% omnivores, ]$count)
Recall that subsetting birdCounts to a single site (example site == apple) can be accomplished using:
birdCounts[birdCounts$site == 'apple', ]
To calculate omnivore counts for a given site, we use & to subset by the omnivore and site logic statements (example site == apple):
# Subset the omnivore counts to site apple:
birdCounts[birdCounts$species %in% omnivores &
birdCounts$site == 'apple', ]
# Extract the count column:
birdCounts[birdCounts$species %in% omnivores &
birdCounts$site == 'apple', ]$count
# Calculate the sum:
sum(birdCounts[birdCounts$species %in% omnivores &
birdCounts$site == 'apple', ]$count)
Exercise Four:
Using the
birdHabitsandbirdCountsdata frames, modify the function below such that it will calculate the number of species of a given guild at a selected site.
richnessSiteGuild <- function(site, guild){ guildSpp <- birdHabits[birdHabits$foraging # COMPLETE countSppSubset <- birdCounts[birdCounts$ # COMPLETE countSppSiteSubset <- countSppSubset[# COMPLETE nSpp <- # COMPLETE return(nSpp) } richnessSiteGuild('apple', 'ground')
Using the code we generated prior to Exercise Four, we should be able to develop a for loop to count omnivores at each site. Our first goal will be to get a vector of birds that are omnivores in the birdHabits data frame:
# Extract vector of omnivorous species:
omnivores <- birdHabits[birdHabits$diet == 'omnivore',]$species
Split: To evaluate the number of omnivores per site, we split the data into individual sites. To do so, we generate a vector of unique sites and query the data frame by the site at a given position in the site vector.
# Generate a vector of unique sites:
sites <- unique(birdCounts$site)
# Site at position i:
i <- 3
sites[i]
# Subset data:
birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
birdCounts_siteSubset
We then use %in% to subset to just omnivores and $ to subset to just the count field:
# Just a vector of omnivore counts:
countVector <-
birdCounts_siteSubset[birdCounts_siteSubset$species %in%
omnivores,]$count
Apply: We then apply a function to each individual part.
# Get total number of omnivores at the site:
nOmnivores <- sum(countVector)
Combine: Output may be combined together as we have with previous for loop statements in this lesson. In the code below, I have combined the sites and nOmnivore vectors into a tibble data frame:
# Generate a vector of unique sites:
sites <- unique(birdCounts$site)
outVector <- vector('numeric', length = length(unique(sites)))
for(i in seq_along(sites)){
# Split:
birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count
# Apply:
outVector[i] <- sum(countVector)
}
# Combine:
data_frame(site = sites, nOmnivores = outVector)
An alternative to combining vectors that I often find useful is combining a list of data frames using the tidyverse function bind_rows:
# Generate a vector of unique sites:
sites <- unique(birdCounts$site)
outList <- vector('list', length = length(unique(sites)))
for(i in seq_along(sites)){
birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count
outList[[i]] <- data_frame(site = sites[i], nOmnivores = sum(countVector))
}
# Combine:
bind_rows(outList)
Exercise Five:
Using the richnessSiteGuild function you created in Exercies Four and the
birdHabitsandbirdCountsdata frames, modify the for loop code below to count the number of species that are ground foragers at each site.
sites <- unique(# COMPLETE outList <- vector('list', length = # COMPLETE for(i in # COMPLETE outList[[i]] <- data_frame(site = sites[i], # COMPLETE } bind_rows(# COMPLETE
We may also be interested in using a for loop to generate a vector of numbers based on some mathematical function. For example, consider we have a value, 10, and want to calculate the mathematical expression:
\[n_t = 2(n_{t-1})\]Let’s start by creating a vector of length 5 for our output:
# For loop output statement:
n <- vector('numeric', length = 5)
n
We must first start with a “seed” value for our vector. This is the initial value of vector n.
# Setting the seed value:
n[1] <- 10
n
Let’s calculate the first five values in this series. Because our for loop starts with our seed value, we are only interested in the second through fifth positions of this vector. Thus, our for loop sequence statement is (DO NOT RUN):
# For loop sequence:
# for(i in 2:length(n))
For each iteration, the following body statement will be evaluated (the example is at position 2):
# Exploring the construction of the for loop body:
i <- 2
n[i]
n[i-1]
n[i] <- 2*n[i-1]
n
And our complete for loop statement becomes:
# Output:
n <- vector('numeric', length = 5)
# Seed:
n <- 10
# For loop:
for(i in 2:5){
n[i] = n*v[i-1]
}
Exercise Six:
One of my favorite for loops was created by Leonardo Bonacci (Fibonacci). He created the first known population model, from which the famous Fibonacci number series was created. He described a population (N) of rabbits at time t as the sum of the population at the previous time step plus the time step before that:
\[N_t = N_{t-1} + N_{t-2}\]
- Use the combine function to create a seed vector of two values, zero and one.
- Use the formula above and your seed vector to generate the first 20 numbers of the Fibonacci number sequence. Hint: The for loop initialization will begin at third position – it will NOT include all of 1:20!
We specify the length of the vector to provide R with stopping rules – without this for loops can become very memory hungry when running over large datasets
The following two sections of code are equivalent, but the latter much easier to read. As writing code is both a tool and a method of communication, you should ensure that your code is as readable as possible.
for(i in 1:length(v)) vNew[i] <- v[i] + 1
for(i in 1:length(v)){
vNew[i] <- v[i] + 1
}
When writing for loops, it is necessary to ensure that the loop is doing what you expect it to do. A simple way to ensure that this is the case is to specify i and run the instructions on just that value. For example, to observe the behavior of the for loop at position 3:
i = 3
vNew[i] <- v[i] + 1
vNew[i]
v[i]